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Abstract. Categorical compositional distributional model of @] suggests a way 
to combine grammatical composition of the formal, type logical models with the 
corpus based, empirical word representations of distributional semantics. This 
paper contributes to the project by expanding the model to also capture entailment 
relations. This is achieved by extending the representations of words from points 
in meaning space to density operators, which are probability distributions on the 
subspaces of the space. A symmetric measure of similarity and an asymmetric 
measure of entailment is defined, where lexical entailment is measured using 
von Neumann entropy, the quantum variant of Kullback-Leibler divergence. 
Lexical entailment, combined with the composition map on word representations, 
provides a method to obtain entailment relations on the level of sentences. Truth 
theoretic and corpus-based examples are provided. 


1 Introduction 


The term distributional semantics is almost synonymous with the term vector space 
models of meaning. This is because vector spaces are natural candidates for modelling 
the distributional hypothesis and contextual similarity between words ira . In a 
nutshell, this hypothesis says that words that often occur in the same contexts have 
similar meanings. So for instance, ‘ale’ and ‘lager’ are similar since they both often 
occur in the context of ‘beer’, ‘pub’, and ‘pint’. The obvious practicality of these 
models, however, does not guarantee that they possess the expressive power needed 
to model all aspects of meaning. Current distributional models mostly fall short of 
successfully modelling subsumbtion and entailment S . There are a number of models 
that use distributional similarity to enhance textual entailment iSlIl . However, most of 
the work from the distributional semantics communit y ha s been focused on developing 
more sophisticated metrics on vector representations llbl 2^. 

In this paper we suggest the use of density matrices instead of vector spaces as 
the basic distributional representations for the meanings of words. Density matrices 
are widely used in quantum mechanics, and are a generalization of vectors. There 
are several advantages to using density matrices to model meaning. Firstly, density 
matrices have the expressive power to represent all the information vectors can 
represent; they are a suitable implementation of the distributional hypothesis. They 
come equipped with a measure of information content, and so provide a natural way 
of implementing asymmetric relations between words such as hyponymy-hypernymy 
relations. Futhermore, they form a compact closed category. This allows the previous 
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work of |@] on obtaining representations for meanings of sentences from the meaning 
of words to be applicable to density matrices. The categorical map from meanings of 
words to the meaning of the sentence respects the order induced by the relative entropy 
of density matrices. This promises, given suitable representations of individual words, 
a method to obtain entailment relations on the level of sentences, inline with the lexical 
entailment of natural logic, e.g. see lEotl . rather than the traditional logical entailment 
of Montague semantics. 


Related Work. This work builds upon and relates to literature on compositional 
distributional models, distributional lexical entailment, and the use of density matrices 
in computational linguistics and information retrieval. 

There has been a recent interest in methods of composition within the distributional 
semantics framework. There are a number of composition methods in literature. See 
Cl for a survey of compositional distributional models and a discussion of their 
strengths and weaknesses. This work extends the work presented in (0, a compositional 
model based on category theory. Their model was shown to outperform the competing 
compositional models by d- 

Research on distributional entailment has mostly been focused on lexical entail¬ 
ment. One notable exception is H, who use the distributional data on adjective-noun 
pairs to train a classifier, which is then utilized to detect novel noun pairs that have 
the same relation. There are a number of non-symmetric lexical entailment measures 
which all rely on some variation of the Distributional Inclusion Hy¬ 
pothesis: “If u is semantically narrower than v, then a significant number of salient 
distributional features of u are also included in the feature vector of v ” ifT^ . In their 
experiments, d show that while if a word v entails another word w then the char¬ 
acteristic features of u is a subset of the ones for w, it’s not necessarily the case that 
the inclusion of the characteristic features u in re indicate that v entails w. One of their 
suggestions for increasing the prediction power of their method is to include more than 
one word in the features. 

II22I] use a measure based on entropy to detect hyponym-hypernym relationships 
in given pairs. The measure they suggest rely on the hypothesis that hypernyms 
are semantically more general than hyponyms, and therefore tend to occur in less 


informative contexts. 111511 rely on a very similar idea, and use KL-divergence between 
the target word and the basis words to quantify the semantic content of the target word. 
They conclude that this method performs equally well in detecting hyponym-hypernym 
pairs as their baseline prediction method that only considers the overall frequency of 
the word in corpus. They reject the hypothesis that more general words occur in less 
informative contexts. Their method differs from ours in that they use relative entropy to 
quantify the overall information content of a word, and not to compare two target words 
to each other. 

1 21 1 extend the compositional model of |@] to include density matrices as we do, but 
use it for modeling homonymy and polysemy. Their approach is complementary to ours, 
and in fact, they show that it is possible to merge the two constructions. |@1 use density 
matrices to model context effects in a conceptual space. In their quantum mechanics 
inspired model, words are represented by mixed states and each eigenstate represents 
a sense of the word. Context effects are then modelled as quantum collapse. Ist] use 
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density matrices to encode dependency neighbourhoods, with the aim of modelling 
context effects in similarity tasks. lE^ uses density matrices to sketch out a theory of 
information retrieval, and connects the logic of the space and of density matrices via an 
order relation that makes the set of projectors in a Hilbert space into a complete lattice. 
He uses this order to define an entailment relation. 12^ show that using density matrices 
to represent documents provides significant improvement on realistic IR tasks. 

This paper is based on the MSc Thesis of the first author 


2 Background 

Definition 1. A monoidal category is compact closed if for any object A, there are 
left and right dual objects, i.e. objects and A, and morphisms rf ■. I ^ A ® A, 

rf ■. I ^ A'" ® A, : A ® A ^ I and : A® A'" ^ I that satisfy: 

(1a 0 e') o {v/- ® 1a) = 1a (e’' ® 1a) o (1a 0 ty'') = 1 a 

(e' ® IaO o (1 a‘ ® tf) = Iai (Ia- ® e’') o (p’' 0 Ia-) = Ia- 

Compact closed categories are used to represent correlations, and in categorical 
quantum mechanics they model maximally entangled states. Et] The 77 and e maps are 
useful in modeling the interactions of the different parts of a system. To see how this 
relates to natural language, consider a simple sentence with an object, a subject and a 
transitive verb. The meaning of the entire sentence is not simply an accumulation of 
the meanings of its individual words, but depends on how the transitive verb relates the 
subject and the object. The rj and e maps provide the mathematical formalism to specify 
such interactions. The distinct left and right duals ensure that compact closed categories 
can take word order into account. 

There is a graphical calculus used to reason about monoidal categories |@1- In the 
graphical language, objects are wires, and morphisms are boxes with incoming and 
outgoing wires of types corresponding to the input and output types of the morphism. 
The identity object is depicted as empty space, so a state ij} : I ^ A is depicted as a 
box with no input wire and an output wire with type A. The duals of states are called 
effects, and they are of type A ^ I. Let f : A ^ B, g : B ^ C and h : C ^ D, and 
1a ■ A ^ A the identity function on A. 1a, f , f ® h, g o f aie depicted as follows: 



The state if : I ^ A, the effect tt : A ^ I, and the scalar r/; o tt are depicted as follows: 
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The maps rf ^rf, e* and e’’ take the following forms in the graphical calculus: 


rf\ 




A 


A‘ A 


A’^ A 


A A‘ 




The axioms of compact closure, referred to as the snake identities because of the 
visual form they take in the graphical calculus, are represented as follows: 




More generally, the reduction rules for diagrammatic calculus allow continuous 
deformations. One such deformation we will make use of is the swing rule: 



Definition 2. i7^/ A pregroup (P, <, •, 1, (—)^ (—)^) is a partially ordered monoid 
in which each element a has both a left adjoint a} and a right adjoint such that 
a}a < 1 < aa} and aa^ < 1 < a. 


If a < 6 it is common practice to write a ^ b and say that a reduces to b. This 
terminology is useful when pregroups are applied to natural language, where each 
word gets assigned a pregroup type freely generated from a set of basic elements. The 
sentence is deemed to be grammatical if the concatenation of the types of the words 
reduce to the simple type of a sentence. For example reduction for a simple transitive 
sentence is n{rAsn’')n Isn^n —>■ Isl — s. 

A pregroup P is a concrete instance of a compact closed category. The r]\ri^,e\ 
maps are ? 7 * = [1 < p • p*], e* = [p* • p < 1 ], rj'’ = [1 < p'^ ■ p], e’’ = [p • p*" < 1 ]. 

FVect as a concrete compact closed category. Finite dimensional vector spaces over the 
base field R, together with linear maps form a monoidal category, referred to as FVect. 
The monoidal tensor is the usual vector space tensor and the monoidal unit is the base 
field R. It is also a compact closed category where V’’ = V'^ = V. The compact closed 
maps are defined as follows: 
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Given a vector space V with basis {~ei}i, 

Tjy = Tiy : M —y V (S) V £y = £y '■ V V —>■ K 

Cjj tit 0 Wj i-» y^cij(vt\u^) 
i ij ij 


Categorical representation of meaning space. The tensor in FVect is commutative up 
to isomorphism. This causes the left and the right adjoints to be the same, and thus for 
the left and the right compact closed maps to coincide. Thus FVect by itself cannot 
take the effect of word ordering on meaning into account. propose a way around this 
obstacle by considering the product category FVect x P where P is a pregroup. 

Objects in FVect x P are of the form (V,p), where V is the vector space for 
the representation of meaning and p is the pregroup type. There exists a morphism 
(/, <) : {^tP) q) if there exists a morphism f : V ^ W in FVect and p < q 

in P. 

The compact closed structure of FVect and P lifts componentwise to the product 
category FVect x P: 

V : (R, 1) ^ (F 0 • p') 77" : (R, 1) ^ (F 0 F,p" • p) 

e' : (F0F,p' -p) ^ (M, 1) e’’ : (F0F,p-p’') ^ (R,l) 


Definition 3. An object (F, p) in the product category is called a meaning space, where 
V is the vector space in which the meanings it S F of strings of type p live. 

Definition 4 . From-meanings-of-words-to-the-meaning-of-the-sentencemap. Let V1V2 ■. ■ Un 
be a string of words, each Vi with a meaning space representation Vi G {Vi,pi). Let 
X G P be a pregroup type such that [piP2 ■. ■ Pn ^ x] Then the meaning vector for the 
string is vfvffTTTvti '.= f{vt 0 I 2 0 ■ ■ ■ 0 In) G (Wi ^)’ where f is defined to be the 
application of the compact closed maps obtained from the reduction \p1P2 ■ • - Pn < a;] 
to the composite vector space Fi 0 F 2 0 ... 0 F„. 

This framework uses the maps of the pregroup reductions and the elements of 
objects in FVect. The diagrammatic calculus provides a tool to reason about both. As 
an example, take the sentence “John likes Mary”. It has the ] 3 regroup type nrf sn^n, 
and the vector representations John, Mary G V and likes G F 0 S' 0 F. The 
morphism in FVect x P corresponding to the map defined in Definition |4] is of 
type (F 0 (F 0 S 0 F) 0 V^nn^sn^n) —?> (S, s). From the pregroup reduction 
[nn^sn^n —>■ s] we obtain the compact closed maps e^leK In FVect this translates 
into ey 0 I 5 0 ey : F 0 (F 0 S 0 F) 0 F —?> S. This map, when applied to 
John 0 likes 0 Mary has the following depiction in the diagrammatic calculus; 
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Note that this construction treats the verb ‘likes’ essentially as a relation that takes 
two inputs of type V, and outputs a vector of type S. For the explicit calculation, note 
that likes = J2ijk ® ^ ® where {vi}i is an orthonormalbasis for V and 

{i}}j is an orthonormal basis for S. Then 




John likes Mary = ey 

E 

ijk 


Is 0 ev{John 0 likes 0 Mary) 


= y {John\vi)'sj{vk\Mary) 


( 1 ) 

( 2 ) 


The reductions in diagrammatic calculus help reduce the final calculation to a 
simpler term. The non-reduced reduction, when expressed in dirac notation is ((cy | 0 

Is ® (eVI) ° \ John 0 likes 0 Mary). But we can swing John and Mary in accord 
with the reduction rules in the diagrammatic calculus. The diagram then reduces to; 



This results in a simpler expression that needs to be calculated; {{John\ 0 Is 0 
{Mary\) o \likes). 


3 Density Matrices as Elements of a Compact Closed Category 

Recall that in FVect, vectors £ V are in one-to-one correspondence with 

morphisms of type v : I ^ V. Likewise, pure states of the form are in one- 

to-one correspondence with morphisms v o : V V such that o v = id/, 
where denotes the adjoint of v (notice that this corresponds to the condition that 
(t;|r') = 1). A general (mixed) state p is a positive morphism of the form p : V ^ V. 
One can re-express the mixed states p : V ^ V as elements p ■. I ^ V* . Here 
V* = V^ = V^ = V. 

Definition 5. / is a completely positive map if f is positive for any positive operator 
A, and ( idy 0 f)B is positive for any positive operator B and any space V. 

Completely positive maps in FVect form a monoidal category Thus one can 
define a new category CPM(FVect) where the objects of CPM(FVect) are the 
same as those of FVect, and morphisms A ^ B in CPM(FVect) are completely 
positive maps A* 0 A —>■ i?* 0 i? in FVect. The elements I ^ Ain CPM(FVect) 
are of the form I*® I A* 0 A in FVect, providing a monoidal category with density 
matrices as its elements. 

CPM(FVect) in graphical calculus. A morphism p ; A —A is positive if and only if 
there exists a map ^ such that p = y/p^ o In FVect, the isomorphism between 
p : A ^ A and '"p”' ; / —A* 0 A is provided by 77 * = rj^. The graphical representation 
of p in FVect then becomes; 
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7 ^ 7 

■Tp / 



7 ^ = ' 

w 




> ^ 



>" 


Notice that the categorical definition of a positive morphism coincides with the 
definition of a positive operator in a vector space, where is the square root of the 
operator. 

The graphical depiction of completely positive morphisms come from the following 
theorem: 


Theorem 1. (Stinespring Dilation Theorem) f : A* ® A ^ B* 0 B is completely 
positive if and only if there is an object C and a morphism ^ : A ^ C ® B such that 
the following equation holds: 


A A 





B B 



\/J and C here are not unique. For the proof of the theorem see 12311 . 


Theorem 2. CPM(FVect) is a compact closed category where as in FVect, F’’ = 
= V and the compact closed maps are defined to be: 

rf = {rjy (g) riy) o [ly ® a ® ly) rf = [rjy ® rfy) o (ly ® a ® ly) 

e' = (ly ®(J® ly) O [cy ® Cy) f = (ly ® (J ® ly) O {cy ® Cy) 

where a is the swap map defined as a(v (E> w) = {w (E> v). 


Proof The graphical construction of the compact closed maps boils down to doubling 
the objects and the wires. The identities are proved by adding bends in the wires. 
Consider the diagram for rf: 



These maps satisfy the axioms of compact closure since the components do. 
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The concrete compact closed maps are as follows: 

r/' = p’' : R ^ (V" (g) 1/) (g) (F (g) V") 

::1 ^ et (g) et (g) ^ e| (g) e| 

* j 

e' = e’' : (F (g) F) (g) (F (g) F) ^ R 

ijkl ijkl 


Let p : Fi(g)V2(g)...(g)14, Fi(g)V 2 (g)-.-(g)F„bea density operator defined 

on an arbitrary composite space Fi (g) 1^ (g) ... (g) Fi- Then it has the density matrix 
representation p : / —>• (Fl (g) F 2 (g) ... (g) 14)* (g) (Fi (g) 14 ® ® 14)- Since the 

underlying category FVect is symmetric, it has the swap map a. This provides us with 
the isomorphism: 


(Fi®F2®...®F„)*®(Fi®F2®...®F„) ~ (Fi*®Fi)®(F2*®F2)®...®(F„*®F„) 

So p can be equivalently expressed as p : I —>■ (F 2 *(g)Fi)(g)(F 2 *(g)F 2 )(g).. .(g)(F^(g)F„). 
With this addition, we can simplify the diagrams used to express density matrices by 
using a single thick wire for the doubled wires. Doubled compact closed maps can 
likewise be expressed by a single thick wire. 


t 






A - 


The diagrammatic expression of a from-meanings-of-words-to-the-meaning-of-the- 
sentence map using density matrices will therefore look exactly like the depiction of it 
in FVect, but with thick wires. 


4 Using Density Matrices to Model Meaning 


If one wants to use the full power of density matrices in modelling meaning, one needs 
to establish an interpretation for the distinction between mixing and superposition in the 
context of linguistics. Let contextual features be the salient, quantifiable features of the 
contexts a word is observed in. Let the basis of the space be indexed by such contextual 
features. Individual contexts, such as words in an n-word window of a text, can be 
represented as the superposition of the bases corresponding to the contextual features 
observed in it. So each context corresponds to a pure state. Words are then probability 
distributions over the contexts they appear in. The simple co-occurrence model can be 
cast as a special case of this more general approach, where features and contexts are the 
same. Then all word meanings are mixtures of basis vectors, and they all commute with 
each other. 
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Similarity for density matrices. Fidelity is a good measure of similarity between two 
density matrix representations of meaning because of its properties listed below. 

Definition 6. The fidelity of two density operators p and a is F{p, a) := tr\/p^l'^ap^l'^. 

Some useful properties of fidelity are: 

1. F(p,cr) = F{a,p) 

2 . 0 < F(p, cr) < 1 

3. F(p, cr) = 1 if and only if p = a 

4. If \(j)){(j)\ and |'!/’)(t/’| are two pure states, their fidelity is equal to |(0|^)|. 

These properties ensure that if the representations of two words are not equal to 
each other, they will not be judged perfectly similar, and, if two words are represented 
as projections onto one dimensional subspaces, their similarity value will be equal to 
the usual cosine similarity of the vectors. 

Entailment for density matrices. To develop a theory of entailment using density 
matrices as the basic representations, we assume the following hypothesis: 

Definition 7 (Distributional Hypothesis for Hyponymy). The meaning of a word w 
subsumes the meaning of a word v if and only if it is appropriate to use w in all the 
contexts v is used. 

This is a slightly more general version of the Distributional Inclusion Hypothesis 
(DIH) stated in Mlbll . The difference lies in the additional power the density matrix 
formalism provides: the distinction between mixing and superposition. Further, DIH 
only considers whether or not the target word occurs together with the salient 
distributional feature at all, and ignores any possible statistically significant correlations 
of features; her e ag ain, the density matrix formalism offers a solution. 

Note that 111 111 show that while there is ample evidence for the distributional 
inclusion hypothesis, this in itself does not necessarily provide a method to detect 
hyponymy-hypernymy pairs. One of their suggestions for improvement is to consider 
more than one word in the features, equivalent to what we do here by taking correlations 
into account in a co-occurrence space where the bases are context words. 

Relative entropy quantifies the distinguishability of one distribution from another. 
The idea of using relative entropy to model hyponymy is based on the assumption that 
the distinguishability of one word from another given its usual contexts provides us with 
a good metric for hyponymy. For example, if one is given a sentence with the word 
dog crossed out, it will be not be possible for sure to know whether the crossed out 
word is not animal just from the context (except perhaps very particular decelerational 
sentences which rely on world knowledge, such as ‘All - bark’.) 

Definitions. The (quantum) relative entropy of two density matrices p and a is 
N{p\\a) := tr{p log p) — tr(p log a), where 0 log 0 = 0 and x log 0 = c» when x 0 
by convention. 

Definition 9. The representativeness between p and a is R{p, a) := 1/(1 + (pj |cr)), 
where N{p\\a) is the quantum relative entropy between p and a. 
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Quantum relative entropy is always non-negative. For two density matrices p and 
(T, N{p\\a) = oo if supp(/ 9 ) n ker(cr) 7 ^ 0, and is finite otherwise. The following is a 
direct consequence of these properties; 

Corollary 1. For all density matrices p and cr, R{p, <j) < 1 with equality if and only if 
p = a, and 0 < R{p, cr) with equality if and only if supp{p) fl ker{a) 7 ^ 0 

The second part of the corollary reflects the idea that if there is a context in which 
it is appropriate to use v but not w, then v is perfectly distinguishable from w. Such 
contexts are exactly those that fall within supp(p) fl ker(cr). 


Characterizing hyponyms. The quantitative measure on density matrices given by 
representativeness provide a qualitative preorder on meaning representations as follows: 

p ^ cr if i?(p, cr) > 0 
p ^ cr if -< (j and a < p 


Proposition 1. The following are equivalent: 

1. p < a 

2. supp{p) C supp(a) 

3. There exists a positive operator p' and p > 0 such that a = pp p' 

Proof (1) ^ (2) and (2) => (1) follow directly from Corollary[T] 

(2) (3) since supp(p) C supp(cr) implies that there exists a p > 0 such that 
cr — pp is positive. Setting p' = a — pp gives the desired equality. 

(3) => (2) since p > 0, and so supp(p) C supp(ct) = supp(pp -f p'). 

The equivalence relation ^ groups any two density matrices p and a with supp(p) = 
supp(cr) into the same equivalence class, thus maps the set of density matrices on 
a Hilbert space R onto the set of projections on R. The projections are in one-to- 
one correspondence with the subspaces of R and they form an orthomodular lattice, 
providing a link to the logical structure of the Hilbert space ll^ aims to exploit by 
using density matrices in IR. 

Let w and v be density matrix representations of the words v and w. Then r; is a 
hyponym of w in this model ifv^w and v oo w. 

Notice that even though this ordering on density matrices extracts a yes/no answer 
for the question “is v a hyponym of wT\ the existence of the quantitative measure 
lets us to also quantify the extent to which r; is a hyponym of w. This provides some 
flexibility in characterizing hyponymy through density matrices in practice. Instead of 
calling V a hyponym of w even when i?(u, w) gets arbitrarily small, one can require the 
representativeness to be above a certain threshold e. This modification, however, has the 
down side of causing the transivity of hyponymy to fail. 
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5 From meanings of words to the meanings of sentences passage 

As in the case for FVect x P, CPM(FVect) x P is a compact closed category, 
where the compact closed maps of CPM(FVect) and P lift component-wise to the 
product category. 

Definition 10. A meaning space in this new category is a pair {V* ® V,p) where 
V* 0 V is the space in which density matrices u : / —>■ V* 0 V of the pregroup type p 
live. 

Definition 11. Let viV 2 ■ ■ - Vn be a string of words, each Vi with a meaning space 
representation Vi G {y*®Vi,pi).Letx G P be a pregroup type such that [pip 2 ■ ■ -Pn < 
a;]. Then the meaning density matrix for the string is defined as: 

VivfffTVn := f{vi ®V 2 ®...®vf)& [W* ® W, x) 

where f is defined to be the application of the compact closed maps obtained from the 
reduction [piP 2 ■ ■ ■ Pn ^ x] to the composite density matrix space {Vi 0 Vj*) 0 {Vf 0 
F 2 ) ® ® {Vf 0 Vn). 

From a high level perspective, the reduction diagrams for CPM(FVect) x P look 
no different than the original diagrams for FVect x P, except that we depict them with 
thick instead of thin wires. Consider the previous example: “John likes Mary”. It has 
the pregroup type n{rf sii} )n, and the compact closed maps obtained from the pregroup 
reduction is (e’’ 0 1 0 e^). 

One can also depict the diagram together with the internal anatomy of the density 
representations in FVect: 



The graphical reductions for compact closed categories can be applied to the 
diagram,establishing {Pohn®likes®Mary) = {John®l®Mary)olikes. 

As formalised in natural logic, one expects that if the subject and object of 
a sentence are common nouns which are, together with the verb of the sentence, 
moreover, upward monotone, then if these are replaced by their hyponyms, then the 
meanings of the original and the modified sentences would preserve this hyponymy. 
The following proposition shows that the sentence meaning map for simple transitive 
sentences achieves exactly that. 





12 


E. Balkir, M. Sadrzadeh & B. Coecke 


Theorem 3. Ifp,a,S,j G {N* ® N, n), a,[3G {N* ^ N ^ S* ^ S ^ N* 0 N, n^sn'^ 
p ^ a, S ^ and a ^ f3 then 

f{p ® a®5) < f{a 0 /3 0 7 ) 

where f is the from-meanings-of-words-to-the-meaning-of-the-sentence map in defini- 
tion \ll\ 

Proof. If p < a, 5 < and a < fi, then there exists a positive operator p' and r > 0 

such that a = rp + p', a positive operator 5' and d > 0 such that 'j = d6 + 6' and a 

positive operator a' and a > 0 such that (3 = aa + a' hy Proposition [T] Then 

f{a 0 ^ 0 7 ) = (e’’ 0 1 0 e*)(cr 0 /3 (g) 7 ) 

= (cr (g) 1 (g) 7 ) o /3 

= {{rp + p') g) 1 g) {d6 + 6')) o {aa + a') 

= {rp gig) dS) o {aa + a') + (p^ g 1 g S') o {aa + a') 

= {rp gig dS) o aa + {rp gig dS) o a' + (p' g 1 g S') o {aa + a'), 
f{p gag5) = (pglg5)oa 

since r,d,a f 0, supp(/(p g a g (5)) C supp(/(cr g /3 g 7 )), which by Proposition[T] 
proves the theorem. 

6 Truth Theoretic Examples 

We present several examples that demonstrate the application of the from-meanings-of- 
words-to-the-meaning-of-sentence map, where the initial meaning representations of 
words are density matrices, and explore how the hierarchy on nouns induced by their 
density matrix representations carry over to a hierarchy in the sentence space. 

6.1 Entailment between nouns 

Let “lions”, “sloths”, “plants” and “meat” have one dimensional representations in the 
noun space of our model; 

lions = \lions) {lions\ sloths = \sloths){sloths\ 

meat = \meal) {meal\ plants = \plants){plants\ 

Let the representation of “mammals” be a mixture of one dimensional representa¬ 
tions of individual animals: 

mammals = l/ 2 |Hons)(^ions| + l/2\sloths){sloths\ 


Notice that 


N {lions\\mammals) = tr{lions\oglions) — tr {lions log mammals) 
= log 1 - i log i = 1 
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Hence R{lions, mammals) = 1 /2. For the other direction, since the intersection of the 
support of mammals and the kernel of lions is non-empty, R{mammals, lions) = 0. 
This confirms that lions -< mammals. 


6.2 Entailment between sentences in one dimensional truth theoretic space 


Consider a sentence space that is one dimensional, where 1 stands for true and 0 for 
false. Let sloths eat plants and lions eat meat; this is represented as follows 


eat ={\sloths)\plants) + \lions)\meat)){{sloths\{plants\ + {lions\{meat\) 


i{\sloths){sloths\ 0 \plants){plants\) + {\sloths){lions\ 0 \plants){meat\)+ 
{\lions){sloths\ 0 \mea!t){plants\) + {\lions){lions\ 0 \meat){meat\) 


The above is the density matrix representation of a pure composite state that relate 
“sloths” to “plants” and “lions” to “meat”. If we fix the bases {lions, sloths} for A^i, 
and {meat, plants} for N 2 , we will have eat : C)7Vi —>■ A ^2 <8 with the following 

matrix representation: 


/I 0 0 1\ 
0 0 0 0 
0 0 0 0 
\10 0 1 / 


“Lions eat meat” . This is a transitive sentence, so as before, it has the pregroup type: 
nn^sn'^n. Explicit calculations for its meaning give: 

(e^v 0 Is 0 e}^){lions 0 eat 0 meat) 

= {lions\sloths)'^ {plantslmeak)"^+ 

{lions\sloths) {lions\lions) {mea!t\mea!t) {plants\meal) + 
{lions\lions){lions\sloths){mea!t\meat){plants\meal) + 
{lions\lions)'^ {meatlmeal)"^ 

= 1 


“Sloths eat meat” . This sentence has a very similar calculation to the one above with 
the resulting meaning: 


(e^ 0 Is C) e'}^){sloths 0 eat 0 meat) = 0 
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“Mammals eat meat” . This sentence has the following meaning calculation; 

(e^v 0 Is 0 (mammals ® eat ® meat) 

= (cjv ® Is ® e^pf){{-lions + -sloths) (Si eat (S meat) 

— 2 ® Is ® e'j^)(lions ® eat ® meat) + 2 (^tv ® Is ® e]^)(sloths ® eat ® meat) 

_ 1 
“ 2 

The resulting meaning of this sentence is a mixture of “lions eat meat”, which is 
true, and “sloths eat meat” which is false. Thus the value 1/2 can be interpreted as 
being neither completely true or completely false; the sentence “mammals eat meat” is 
true for certain mammals and false for others. 

6.3 Entailment between sentences in two dimensional truth theoretic space 

The two dimensional truth theoretic space is set as follows; 

true = \ 0 ) = /a/se = |1) = 

The corresponding true andfalse density matrices are |0)(0| and |1)(1|. 

In the two dimensional space, the representation of “eats” is set as follows. Let 
A = {lions, sloths} and B = {meat,plants}, then 

^t= Y. 

ai ,a2€A 
bi,b2GB 

where 

I ^ _ f | 0 ) if |ai)|6i), 102)162) G {|/ions)|TOeal), \sloths)\plants)} 

Ml) otherwise 

The generalized matrix representation of this verb in the spirit of 13 is; 


/l 0 0 1 

0 1 1 0\ 

0 0 0 0 

1111 

0 0 0 0 

1111 

\1 0 0 1 

0 110/ 


“Lions eat meat”. The calculation for the meaning of this sentence is almost exactly 
the same as the case of the one dimensional meaning, only the result is not the scalar 
that stands for true but its density matrix; 

(eJv 0 Is (8) e'pj){lions 0 eat 0 meat) = |0)(0| 
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“Sloths eat meat". Likewise, the calculation for the meaning of this sentence returns 
false: 

(e^ 0 Is 0 e'f^){sloths ® eat ® meat) = |1)(1| 

“Mammals eat meat”. As we saw before, this sentence has the meaning that is the 
mixture of “Lions eat meat” and “Sloths eat meat”; here, this is expressed as follows: 

(e^ 0 Is ® e)^){mammals ® eat 0 meat) 

= 2 ® Is ® e)^){lions 0 eat 0 meat) + 2 ® Is ® e'jq){sloths ® eat ® meat) 

= ^| 1 )( 1 | + ^| 0 )( 0 | 

So in a two dimensional truth theoretic model, “Mammals eat meat” give the completely 
mixed state in the sentence space, which has maximal entropy. This is equivalent to 
saying that we have no real knowledge whether mammals in general eat meat or not. 
Even if we are completely certain about whether individual mammals that span our 
space for “mammals” eat meat, this information differs uniformly within the members 
of the class, so we cannot generalize. 

Already with a two dimensional truth theoretic model, the relation lions -< 
mammals carries over to sentences. To see this, first note that we have 


N{lions eat meat\\mammals eat meat) = N ^|0)(0| — 

= (|0)(0|)log(|0)(0|) - (|0)(0|)log Q|0)(0| + i|l)(l| 
= 1 


+ 2 | 1 )( 1 | 


In the other direction, we have N{mammals eat meat\\lions eat meat) = 00 , since the 
intersection of the support of the first argument and the kernel of the second argument 
is non-trivial. These lead to the following representativeness results between sentences: 

R{lions eat meat, mammals eat meat) = 1/2 

R{mammals eat meat\\lions eat meat) = 0 

As a result we obtain: 

lions eat meat -< mammals eat meat 


Since these two sentences share the same verb phrase, from-meaning-of-words-to- 
the-meaning-of-sentence map carries the hyponymy relation in the subject words of the 
respective sentences to the resulting sentence meanings. By using the density matrix 
representations of word meanings together with the categorical map from the meanings 
of words to the meanings of sentences, the knowledge that a lion is an animal lets us 
infer that “mammals eat meat” implies “lions eat meat”: 


{lions -< mammals) —>■ {lions eat meat -< mammals eat meat) 
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“Dogs eat meat”. To see how the completely mixed state differs from a perfectly 
correlated but pure state in the context of linguistic meaning, consider a new 
noun dog = \dog){dog\ and redefine eat in terms of the bases {lions,dogs} and 
{meat, plants}, so that it will reflect the fact that dogs eat both meat and plants. We 
define “eat” so that it results in the value of being “half-true half-false” when it takes 
“dogs” as subject and “meat” or “plants” as object. The value “half-true half-false” is 
the superposition of true and false: ^jO) -f 5 II). With this assumptions, eat will still 
be a pure state with the following representation in FVect: 

leal) =\lions) 0 |0) 0 \meal) + \lions) 0 |1) 0 \plants)+ 

\dogs) (g) (^|0) -f i|l)) (g) I meal) -f \dogs) (g) (i|0) -f i|l)) (g) \plants) 

Hence, the density matrix representation of “eat” becomes: 

eat = |eal)(eal| 

The calculation for the meaning of the sentence is as follows: 

(e^Y 0 Is 0 e}^){dogs (g) eat (g) meat) 

= (e)v 0 Is ® e}f){\dogs){dogs\ (g) |eal)(eal| (g) |meal)(meal|) 

= (^|o) + i|i))(i(o| + i(i|) 

So in this case, we are certain that it is half-true and half-false that dogs eat meat. This 
is in contrast with the completely mixed state we got from “Mammals eat meat”, for 
which the truth or falsity of the sentence was entirely unknown. 


“Mammals eat meat”, again. Let “mammals” now be defined as: 

mammals = -lions + -dogs 
The calculation for the meaning of this sentence gives: 

(e^Y (g) Is <g> e}f){mammals (g) eat (g) meat) 

= 2 <g> Is ® e'fi){lions ® eat (g) meat) + 2 ® Is ® e}q){dogs ® eat (g) meat) 

= ^|0)(0| + i|0)(l| + i|l)(0| + i|l)(l| 


This time the resulting sentence representation is not completely mixed. This means 
that we can generalize the knowledge we have from the specific instances of mammals 
to the entire class to some extent, but still we cannot generalize completely. This is a 
mixed state, which indicates that even if the sentence is closer to true than to false. 
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the degree of truth isn’t homogeneous throughout the elements of the class. The non¬ 
zero non-diagonals indicate that it is also partially correlated, which means that there 
are some instances of “mammals” for which this sentence is true to a degree, but not 
completely. The relative similarity measures between true and false and the sentence 
can be calculated explicitly using fidelity: 


F(f\l){l\, mammals eat meat) = {l\mammals eat meat\l) 
F(jo){0\, mammals eat meat) = {0\mammals eat meat\0) 


1 

4 

3 

4 


Notice that these values are different from the values for the representativeness 
for truth and falsity of the sentence, even thought they are proportional: the more 
representative their density matrices, the more similar the sentences are to each other. 
For example, we have: 


A^(|l)(l| II mammals eat meat) 

= tr(|l)(l|)log(|l)(l|)) - tr(|l)(l| log(^|0)(0| + i|0)(l| + i|l)(0| + i|l)(l|)) 
« 2 


Hence, i?(|l)(l| || mammals eat meat) « .33. On the other hand: 

7V(|0)(0| II mammals eat meat) 

= tr(|0)(0|) log(|0)(0|)) - tr(|0)(0| log( J|0)(0| + i|0)(l| + i|l)(0| + i|l)(l|)) 
« 0.41 


Hence, i?(|0)(0| || mammals eat meat) « 0.71 


7 A Distributional Example 

The goal of this section is to show how one can obtain density matrices for words using 
lexical taxonomies and co-occurrence frequencies counted from corpora of text. We 
show how these density matrices are used in example sentences and how the density 
matrices of their meanings look like. We compute the representativeness formula for 
these sentences to provide a proof of concept that this measure does makes sense for 
data harvested from corpora distributionally and that its application is not restricted to 
truth-theoretic models. Implementing these constructions on real data and validating 
them on large scale datasets constitute work in progress. 


7.1 Entailment between nouns 

Suppose we have a noun space N. Let the subspace relevant for this part of the example 
be spanned by lemmas pub, pitcher, tonic. Assume that the (non-normalized version of 
the) vectors of the atomic words lager and ale in this subspace are as follows: 

lager = 6 x pul+5 x pitcher+0x tonic ale = 7 xpul>+3x pitcher+0x tonic 
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Suppose further that we are given taxonomies such as ‘beer = lager + ale’, harvested 
from a resource such as WordNet. Atomic words (i.e. leafs of the taxonomy), 
correspond to pure states and their density matrices are the projections onto the one 
dimensional subspace spanned by | Non-atomic words (such as beer) are also 

density matrices, harvested from the corpus using a feature-based method similar to that 
of il If] , This is done by counting (and normalising) the frequency of times a word has 
co-occurred with a subset B of bases in a window in which other bases (the ones not in 
E) have not occurred. 

Formally, for a subset of bases {51, 52,5 ?t,}, we collect co-ordinates Cij for each 
tuple \hi)\hj) and build the density matrix 'Yhij Cij\bi)\bj). 

For example, suppose we see beer six times with just pub, seven times with both 
pub and pitcher, and none-whatsoever with tonic. Its corresponding density matrix will 
be as follows; 


beer = 6 x 


\puli){puli\ + 7 X {\puli) + \pitcher)){{pul)\ + {pitcher\) 


= 13x 


\puli) {pui)\ + 7 X \puli) {pitcher \ + 7 x \pitcher) {p'ui}\ + 7 x \pitcher) {pitcher\ 


To calculate the similarity and representativeness of the word pairs, we first 
normalize them via the operation then apply the corresponding formulae. For 
example, the degree of similarity between ‘beer’ and ‘lager’ using fidelity is as follows: 


Try lager ■ beer ■ lager = 0.93 


The degree of entailment lager -< beer is 0.82 as computed as follows: 

-= 0.82 

1 -I- Tx{lager ■ \og{lager) — lager ■ log(5eer)) 

The degree of entailment beer -< lager is 0, like one would expect. 


7.2 Entailment between sentences 

To see how the entailment between sentences follows from the entailment between 
words, consider example sentences ‘Psychiatrist is drinking lager’ and ‘Doctor is 
drinking beer’. For the sake of brevity, we assume the meanings of psychiatrist and 
doctor are mixtures of basis elements, as follows; 

psychiatrist = 2 x \patient){patient] -f 5 x \mental){mental] 

doctor = 5 X ]patient){patienl] + 2 x ]mental){mental] -f 3 x ]surger^){surgerl)] 

The similarity between psychiatrist and doctor is: 

S{psychiatrist, doctor) = S{doctor, psychiatrist) = 0.76 

The representativeness between them is: 

R{psychiatrist, doctor) = 0.49 R{doctor,psychiatrist) = 0 
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We build matrices for the verb drink following the method of Q. Intuitively this is 
as follows: the value in entry (i, j) of this matrix will reflect how typical it is for the 
verb to have a subject related to the rth basis and an object related to the jth basis. We 
assume that the small part of the matrix that interests us for this example is as follows: 


drink 

pub 

pitcher 

tonic 

patient 

4 

5 

3 

mental 

6 

3 

2 

surgery 

1 

2 

1 


This representation can be seen as a pure state living in a second order 
tensor. Therefore the density matrix representation of the same object is drink = 
\drini;){drinf:\, a fourth order tensor. Lifting the simplifications introduced in 114| 
from vectors to density matrices, we obtain the following linear algebraic closed forms 
for the meaning of the sentences: 


Psychiatrist is drinking lager = drink 0 {psychiatrist 0 lager) 
Doctor is drinking beer = drink 0 {doctor 0 beer) 


Applying the fidelity and representativeness formulae to sentence representations, we 
obtain the following values: 

S{Psychiatrist is drinking lager, Doctor is drinking beer) = 0.81 
R{Psychiatrist is drinking lager, Doctor is drinking beer) = 0.53 
R{Doctor is drinking beer. Psychiatrist is drinking lager) = 0 


From the relations psychiatrist 0 doctor and lager 0 beer we obtain the desired 
entailment between sentences: 


Psychiatrist is drinking lager 0 Doctor is drinking beer. 

The entailment between these two sentences follows from the entailment between 
their subjects and the entailment between their objects. In the examples that we have 
considered so far, the verbs of sentences are the same. This is not a necessity. One can 
have entailment between sentences that do not have the same verbs, but where the verbs 
entail each other, examples can be found in ||2l • The reason we do not present such cases 
here is lack of space. 


8 Conclusion and Future Work 

The often stated long term goal of compositional distributional models is to merge 
distributional and formal semantics. However, what formal and distributional semantics 
do with the resulting meaning representations is quite different. Distributional 
semanticists care about similarity while formal semanticists aim to capture truth and 
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inference. In this work we presented a theory of meaning using basic objects that 
will not confine us to the realm of only distributional or only formal semantics. The 
immediate next step is to develop methods for obtaining density matrix representations 
of words from corpus, that are more robust to statistical noise, and testing the usefulness 
of the theory in large scale experiments. 

The problem of integrating function words such as ‘and’, ‘or’, ‘not’, ‘every’ into 
a distributional setting has been notoriously hard. We hope that the characterization of 
compositional distributional entailment on these very simple types of sentences will 
provide a foundation on which we can define representations of these function words, 
and develop a more logical theory of compositional distributional meaning. 
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